Squaring circuit

ABSTRACT

Methods, apparatuses, and computer program products for squaring an operand include identifying a fixed-point value with a fixed word size and a substring size for substrings of the fixed-point value, wherein the fixed-point value comprises a binary bit string. A square of the fixed-point value can be determined using the fixed point value, the substring size, and least significant bits of the fixed-point value equal to the substring size.

FIELD

The present disclosure is directed to a squaring technique that can be implemented as a circuit or as a software algorithm, and more particularly, a squaring technique that uses an arbitrary radix number system.

BACKGROUND

Squaring is an arithmetic operation used in many digital systems. Squaring circuits can be used for digital signal processing applications, such as image compression, pattern recognition, and others. Squaring is also used as an atomic computation for some cryptography algorithms. Squaring circuit architecture is also commonly incorporated in graphics processors. Several general purpose multiplier circuit designs have also been proposed based on squaring of input operands.

SUMMARY

Certain aspects of the present disclosure pertain to methods, circuit elements, and computer program products for squaring a value. A fixed-point value with a fixed word size and a substring size for substrings of the fixed-point value can be identified, wherein the fixed-point value comprises a binary bit string. A square of the fixed-point value can be determined using the fixed point value, the substring size, and least significant bits of the fix-point value equal to the substring size.

In some implementations, a square can be determined by iteratively determining squares of substrings of the fixed-point value using least significant bits of each operand equal to the substring size and the substring of the fixed-point value, wherein the operand in each iteration comprises a portion of the previous operand, wherein the operand is formed by decatenating the previous operand least significant bits equal to the substring size.

In some implementations, determining a square of a fixed point value can include identifying the fixed-point value as an operand. A substring of the operand can be determined as the least significant bits of the operand where the substring is of a specified substring size. The substring can be decatentated from the operand to form a word. The substring can be squared using the word, the substring, and the substring size. The square of the substring can be added to a result. If a length of the word is greater than zero, the word can be identified as the operand and the determining, decatenating, squaring, and adding steps can be executed. If the length of the word and substring is zero, one more iteration is undertaken to account for non-zero residual values, and the result is identified as the square of the fix-point value.

In some implementations, the following expansion can be calculated:

${\alpha^{2} = {{\left( \frac{A}{\beta} \right)^{2}\beta^{2}} + {\left( \frac{A}{\beta} \right)\beta^{2}} + \left( \frac{\beta}{2} \right)^{2} + {2\left( {A + \frac{\beta}{2}} \right)b} + b^{2}}},$

where A is the word, β is the radix, the substring size is log₂ [β], and b is the substring value minus β/2.

The details of one or more embodiments of the disclosure are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims. For example, hardware-based squaring circuits, such as those described here, can accommodate the increasing demand for cryptography hardware support in low power, high-speed mobile devices.

DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic diagram of an example landscape that includes a device with a squaring circuit in communication with a network.

FIG. 2 is a schematic block diagram of an example squaring circuit in accordance with the present disclosure.

FIGS. 3A-3C are example diagrams of a squaring circuit operating on a six bit string using a two bit substring.

FIGS. 4A-4B are example diagrams of a squaring circuit operating on a six bit string using a three bit substring.

FIG. 5A is an example of a portion of a process flow diagram for squaring an input value in accordance with the present disclosure.

FIG. 5B is an example of another portion of the process flow diagram for squaring an input value in accordance with the present disclosure.

DETAILED DESCRIPTION

The present disclosure describes an iterative squaring technique that produces a 2 nm-bit length result, α², based on an input operand (often referred to as a squarand) α of nm-bits in length. The circuit produces 2m bits of the output α² during each iterative step. By considering an m-bit grouping within the squarand α as representing a single radix-2^(m) digit, the circuit can be considered a digit-serial implementation that produces two m-bit digits per iteration.

This digit-serial architecture may allow for a tradeoff between bit-serial and parallel architectures by allowing for the digit to be represented by m bits. Because 2m bits of the result are computed in each iterative step, varying m can yield more or less parallelism while inversely affecting required circuit area. Thus, a minimal or otherwise reduced area circuit can be realized when m is small (bit-serial for the case m=1) and a large parallel circuit results at the other extreme when m is set to the wordsize of the squarand. Designers may be able to choose an appropriate value of m such that performance requirements are met while minimizing or otherwise reducing the amount of circuitry required.

Arithmetically, the technique assumes the squarand is represented as a higher-radix digit string where each digit is represented by an m-bit substring. Furthermore, the technique may yield two digits of output squared value during each iterative step; hence, a total of 2m bits of the squared result are computed at each iterative step.

FIG. 1 is a schematic diagram of an example landscape 100 that includes a device 102 having a squaring module 106 in accordance with the present disclosure in communication with a network 104. The device 102 may be any type of computing device, such as a personal computer, a touch screen terminal, a workstation, a network computer, kiosks, wireless data ports, wireless or wireline phones, smartphones, personal data assistants (PDAs), one or more processors within these or other devices, or any other suitable processing device, to execute operations associated with squaring algorithms. For example, device 102 may be a PDA operable to wirelessly connect with a network 104. In another example, client 102 may be a laptop or tablet computer that includes an input device, such as a keypad, touch screen, mouse, or other device that can accept information, and an output device that conveys information, including digital data, visual information, or graphical user interface. Device 102 may also be a server that can execute operations using input data received from other devices and can send results of operations to other devices across network 104.

The device 102 includes a squaring module 106. The squaring module 106 (described in more detail in FIG. 2) receives as an input a squarand α 108 and a value m 110 that indicates the substring bit length for the squaring operation. The squaring module 106 outputs a result α² 112. The squarand 108 and the substring bit length 110 may be received locally through an input device of device 102, or may be received from a device across network 104. The result 112 may be displayed to a user of device 102 on a local display or graphical user interface. In some implementations, the result 112 can be transmitted to another device across network 104.

Network 104 facilitates wireless or wireline communication between device 102 and other devices. Network 104 may be all or a portion of an enterprise or secured network. In another example, network 104 may be a VPN between device 102 and other devices across a wireline or wireless link. Such an example wireless link may be via 802.11a, 802.11b, 802.11g, 802.11n, 802.20, WiMax, and many others. The wireless link may also be via cellular technologies such as 3GPP GSM, UMTS, LTE, etc. While illustrated as a single or continuous network, network 104 may be logically divided into various sub-nets or virtual networks without departing from the scope of this disclosure, so long as at least portion of network 104 may facilitate communications between senders and recipients of requests and results. In other words, network 104 encompasses any internal and/or external network, networks, sub-network, or combination thereof operable to facilitate communications between various computing components in system 100. Network 104 may communicate, for example, Internet Protocol (IP) packets, Frame Relay frames, Asynchronous Transfer Mode (ATM) cells, voice, video, data, and other suitable information between network addresses. Network 104 may include one or more local area networks (LANs), radio access networks (RANs), metropolitan area networks (MANs), wide area networks (WANs), all or a portion of the global computer network known as the Internet, and/or any other communication system or systems at one or more locations.

The following notation may be used in the description of the digit-serial fixed-point squaring algorithm:

β represents the radix or base of a number system. β may be in the set of natural numbers, βε

.

The ‘radix polynomial’ form of a value a is written as an n-term polynomial of the form: α=a _(n−1)β_(n−1) +a _(n−2)β_(n−2) + . . . +a ₂β₂ +a ₁β₁ +a ₀β₀

A value α can also be represented in the radix-β number system in the form of a positional string of n characters denoted by α=[a_(n−1) a_(n−2) . . . a₂ a₁ a₀]. For clarity, the character strings denoting the positional digit representations of a value α may be enclosed by square brackets. The digits a_(i) are the coefficients of the radix-polynomial form and their position within the string inherently denotes the exponent of the radix β.

Each character a_(i) in a positional string representing a value is referred to as a “digit” regardless of the radix of the number system. Binary digits may alternatively be referred to as “bits.”

Digits are restricted to the natural numbers when β≦10, and are members of the set: {a _(i)ε

|0≦a _(i)≦β−1}. For the case where β>10, alternative single characters are used to represent a digit such as the characters “A” through “F” for the case of β=16.

Where necessary for clarity, digit strings are subscripted by the radix β of the particular number system being used, α=[a_(n−1) a_(n−2) . . . a₂ a₁ a₀]_(β).

LSD(α,k) and MSD(α,k) are operators that yield k least significant or most significant digits, respectively, in the digit string representing a value α. LSD(α,1) represents the least significant digit of α, LSD(α,1)=a₀. Likewise the most significant digit is given as MSD(α,1)=a_(n−1).

{A,B,C} denotes concatenation of the content of registers A, B, and C which can be of any size and whose individual sizes may differ.

SHL(A,k,B) denotes the operation of shifting the content of register A to the left by k bits and setting the least significant k bits to the content of register B. A can be of any size greater than or equal to the size of B and B must be of size k.

SHR(A,k,B) denotes the operation of shifting the content of register A to the right by k bits and setting the most significant k bits of A to the content of register B. A can be of any size greater than or equal to the size of B and B must be of size k.

A←B denotes the operation of setting the content of register A with that of register B. A and B can be the same size in some implementations.

The radix-β value A is defined as A=α−a₀. Expressed as a positional n-digit string: A=[a _(n−1) a _(n−2) . . . a ₂ a ₁0]_(β). Thus, A can be formed by replacing LSD(α,1)=a₀ with the zero digit [0]_(β) or as: A={SHR([α−0]_(β),1,[0]_(β)),[0]_(β)}.

The present disclosure describes a circuit and algorithm such that the choice of radix β allows for a trade-off in logic circuit area versus throughput performance in the computation of α² when α is represented as a binary bit string. Higher values of β allow more bits to be produced per iterative step in the resulting representation of α². A tradeoff occurs in that the amount of computation or logic required at each iterative step increases for higher radix values.

In the basis of the algorithm as stated here, it is assumed that the squarand is of the form of a binary bit string. Intermediate computations can be efficiently implemented when the radix β is in the form β=2^(m) where m is a positive integer m≧2. Efficiency results since β=2^(m) allows each higher radix digit in the string representing α to be equivalent to an m-bit substring within α. α, in terms of a higher-radix digit string, is simply the concatenation of the disjoint m-bit substrings of α in binary form where LSD(α,1) is the least significant m bits, the subsequent next significant higher-radix digit is represented by the next group of m bits to the left of LSD(α,1), and so on.

For convenience in specifying the basis of the algorithm, Equation (1) can be written with the restriction that β=2^(m) and some of the individual terms on the right-hand side of the equation can be denoted as T₁, T₂, and T₃. α² can be written as:

$\begin{matrix} {{\alpha^{2} - {\left( \frac{A}{\beta} \right)^{2}2^{2m}} + {\left( \frac{A}{\beta} \right)2^{2m}} + \left( \frac{\beta}{2} \right)^{2} + {2\left( {A + \frac{\beta}{2}} \right)b} + b^{2}} = {{\left( \frac{A}{\beta} \right)^{2}2^{2m}} + T_{1} + T_{2} + T_{3}}} & (1) \end{matrix}$ The terms T₁, T₂, and T₃ are explicitly defined as follows:

${T_{1} = {{\left( \frac{A}{\beta} \right)2^{2m}} + \left( \frac{\beta}{2} \right)^{2}}};$ ${T_{2} = {2\left( {A + \frac{\beta}{2}} \right)b}};$ T₂ = b²

The idea behind the algorithm may be to compute terms T₁, T₂ and T₃ during each iterative step and accumulate them with the previous result. Subsequent iterations use A/β from the (A/β)² term in Equation (1) as a squarand. The subsequent operand A/β for each iterative step is a digit string containing one less digit than the squarand in the previous step indicating that the iterative algorithm requires O(n/m) iterations to complete. The 2^(2m) shifting factor of the first term in Equation (1) illustrates the fact that two digits (2m length bitstrings) are produced at each step and they represent digits in α² that are produced in the order of the lesser significant digits first.

Several observations may be used to more efficiently implement the computation of the three terms T₁, T₂, and T₃ in the squaring algorithm. First, the term A/β may be efficiently obtained by shifting the digit string representing α one position to the right and discarding a₀, A/β=[a_(n−1) a_(n−2) . . . a₂ a₁]_(β). Second, values that are multiplied by a factor of β=2^(km) may be easily obtained by shifting the value to left by km bit positions and inserting a radix-β zero digit place holder [0]_(β) for the vacated least significant digits. Third, the term β/2 is always of the form of a single radix-β digit. Expressed as an m-bit binary string β/2=[10 . . . 0]₂. Finally, the term (β/2)² is always of the form of two radix-β digits with the most significant digit of value β/4 and the least significant digit of value zero. Hence, expressed as a 2m-bit binary string, (β/2)²=[010 . . . 0]₂.

Term T₁ can be computed in a single operation. Making use of the first and second observations, the value (A/β)2^(2m) is obtained by forming the digit string [a_(n−1) a_(n−2) . . . a₂ a₁00]_(β). Furthermore, based on the fourth observation, T₃=(β/2)² can always be expressed as two radix-2^(m) digits (2m bits) denoted as [q₁q₀]_(β). Thus, T₁ is obtained by forming the string [a_(n−1) a_(n−2) . . . a₂ a₁ q₁ q₀]_(β). From the fourth observation, q₁=β/4 and q₀=0 so that (β/2)²=[q₁ q₀]_(β)=[(β/4)0]_(β). Thus, the digit string representation for T1 is [a_(n−1) a_(n−2) . . . a₂ a₁(β/4)0]_(β).

Term T2 is computed by first forming a digit string representing 2(A+β/2) and then multiplying this string with the single radix-β digit b. Relying on the first, second, and third observations, A=a_(n−1) a_(n−2) . . . a₂ a₁ 0]_(β) and β/2 may be represented as a single unsigned radix-2^(m) digit (m-bit string). Therefore, (A+β/2)=[a_(n−1) a_(n−2) . . . a₂ a₁ β/2]. To account for the multiplicative factor of 2, the (A+β/2)=[a_(n−1) a_(n−2) . . . a₂ a₁(β/2)]_(β) digit string is then shifted by one bit position to the left resulting in 2(A+β/2). The multiplicative factor 2 would in general be implemented through the use of an addition operation, 2(A+β/2)=(A+β/2)+(A+β/2), when a higher-valued radix β is used that is not an integral power of two since this can be considered a “fractional digit shift,” if β≠2^(m).

The final step in the formation of term T₂ involves the multiplication of 2(A+β/2)=[a_(n−1) a_(n−2) . . . a₂ a₁(β/2)]_(β) by the signed single radix-2^(m) digit of b=a₀−β2. Because b is a single digit value, this multiplication may be accomplished with a minimal or reduced amount of computation or circuitry as compared to a general purpose multiply operation or circuit. Clearly, as the value m is increased resulting in a higher valued radix, 2^(m), both computational complexity and overall algorithm throughput may increase. The actual implementation of the multiplication by b may be dependent upon the value m and may be carefully considered for a given realization of the algorithm. Relatively small values of m generally allow for a simple logic circuit or lookup table to be used.

Term T₃=b² relies on the computation of the square of the residual value b. The implementation of this computation may also be dependent upon the size of m, which dictates the number of bits required to represent a radix-2^(m) digit. For smaller values of m, the direct calculation of b² can be very efficiently implemented as a small combinational logic circuit or through a lookup table. As m increases, the computation of b² becomes more complex and other methods may be employed.

For large values of m, the computation of T₃b² can be accomplished in parallel with the computation of the other two terms T₁ and T₂ since accumulation of T₁+T₂+T₃ with overall result can occur at the end of each iterative step.

After terms T₁, T₂, and T₃ are formulated, they are summed together and accumulated with the previous result. The accumulation takes into account the process of multiplying subsequent iterative operands by 2^(2m) and the fact that two independent radix-β digits (or, 2m bits) of the final result are produced at each iterative step. This can be implemented in a variety of ways, including using registers. The size of the register may be 2 nm bits where n is the number of radix-β digits representing a and m denotes the radix. The final operation of each iterative step of the algorithm is to shift the result register 2m bits to the right and insert the 2m least significant bits of T₁+T₂+T₃ into the most significant positions of the shifted result register. Insertion of the two radix-2^(m) digits in the most significant portion of the result register instead of performing a multi-bit left shift before adding them to the previously accumulated result allows the algorithm to be implemented without the need for an inclusion of a multi-bit left shift operation or the use of a barrel shifting circuit in a hardware realization.

The algorithm uses an iteration index i to determine if all digits of the squarand have been produced. For an n-digit radix-α squarand, the squared result consists of 2n digits. Because two digits are produced per iterative step, the index i ranges from zero to (n/2)−1. Initially, when i=0, α is the original squarand. During intermediate computations, when 0<i<n/2, the algorithm iterates and sets the intermediate squarand α=A/β. In the final iterative step, the squarand argument becomes α=0; however, this step is performed to account for circumstances when the residual b is not zero-valued.

Any given implementation of the algorithm should include careful consideration of the manner in which the signed digit b is represented. When explicitly represented using a radix-complement or a signed-magnitude form, m+1 bits are required to account for the sign. Furthermore, depending upon the definition of the residual, b can take on integer values in either of the ranges [−(β/2),(β/2−1)] (as is the case in this formulation) or [−(β/2)+1,(β/2)]. However, because there is a one-to-one relationship between the a₀ and b values (since b=a₀−β/2), the m-bit string representing a₀ can be used as an encoding for the corresponding b value.

The algorithm formulated in the previous section makes use of several registers. For succinctness, the registers used within the algorithm statement are defined in Table 1, shown below:

TABLE 1 Registers Used in Squaring Algorithm name size (bits) content AB (n − 1)m A/β R 2nm α² i log₂(n/m) iteration matrix B m residual b encoded as LSD(α, 1) ACC 2 nm T₁ + T₂ + T₃ T1 2 nm T₁ T2 2 nm T₂ T3 2 nm T₃ B2 m β/2 B4 2 m (β/2)² = (β²/4) = [(β/4)0]_(β) A statement of the algorithm is given below. Intermediate locations within the algorithm are denoted by labels in the form “STEP k.” The labels are included for convenience in referring to certain portions of the algorithm and they also indicate clock boundaries in that the results of STEP k−1 are registered before computation occurs in STEP k. As an example, the T2←{AB,B2} operation of STEP 3 must complete before the T2←SHL(T2,1,[0]2) operation of STEP 4 can proceed. Breaking up the computation of term T2 into multiple intermediate registered operations is an example of pipelining the datapath and allows for the overall circuit clock speed to be increased. The steps are described below:

INPUT:   α: nm-bit fixed-point squarand   m: log₂(β)-bit value, indicates working radix 2^(m) OUTPUT: α²: 2nm-bit value in register R STEP 1:   i←0 /* iteration index */   R←0 /* initialize result register */   B2←[10...0]₂ /* m-bits with MSB=1 */   B4←[010...0]₂ /* 2m-bits with MSBs=01 */   AB←α /* squarand value */ STEP 2:   B←LSD(AB,m) /* encode b as LSD(AB,1) */   AB←SHR(AB,m,[0..0]₂) /* MS squarand digits */ STEP 3:   T1←{AB,B4,[0..0]₂} /* form T1, m LSbs=0 */   T2←{AB,B2} /* form A+β/2 */   T3←b×b /* compute single digit square, uses a₀ in B */ STEP 4:   T2←SHL(T2,1,[0]₂) /* form 2(A+β/2) */   ACC←T1+T3 /* form T₁+T₃ */ STEP 5:   T2←T2×b /* form 2(A+β/2)b, uses a₀ in B */ STEP 6:   ACC←ACC+T2 /* form T₁+T₂+T₃ */ STEP 7:   R←SHR(R,2m,LSD(ACC,2)) /* update result */   i←i+1 /* increment iteration counter */ STEP 8:   if (i=n/m) /* check iteration count */     HALT /* computation complete */   else     GO TO STEP 2 The example algorithm shown above can undergo n/m iterations producing 2m bits of α² during each iterative step. Therefore, the algorithm has temporal complexity equivalent to O(n/m). In terms of required computational resources, the algorithm requires circuitry to perform shifting, bit-string concatenation, 2 nm-bit operand addition, m×2 nm-bit multiplication, and m-bit operand squaring. While 2 nm-bit operand addition operations are required in STEPs 4 and 6, it is noted that a single 2 nm-bit addition circuit can be used since these sums may be formed sequentially allowing for reuse of the single 2 nm-bit adder. The multiplication and single-digit squaring operations can be implemented in a variety of forms although it is noted that due to the relatively small size of the operands (m bits) very compact and fast circuits such as lookup tables are a practical choice.

FIG. 2 is a schematic block diagram of an example squaring module 106 in accordance with the present disclosure. Squaring module 106 can be a hardware circuit composed of analog and digital circuitry. Digital circuitry includes transistor-based logic circuits and components. In some implementations, squaring module 106 can be implemented as a software algorithm. Squaring module 106 can receive as an input the operand α 202, which is a value to be squared (squarand). In this example circuit, a synchronous digital logic circuit uses a quaternary radix, β=2²=4. The operand 202 is received by a multiplexer circuit 204. The computation T3←b×b in STEP 3 above uses a multiplexer based lookup structure. A 4:1 multiplexer with 2m-bit data paths and an m-bit control signal chooses among the appropriate squared values of b. The squared values b² that drive the multiplexer data inputs are pre-computed before implementation of the circuitry and are either hardwired or stored in registers. Alternatively, a small nonvolatile memory such as a ROM or flash circuit, or a volatile memory such as SRAM or DRAM, could be used with the B register contents driving the address lines and all possible b² values stored in the memory. Register B drives the control lines of the multiplexer and represents the residual value b. It is noted that B actually contains the least significant digit at as an encoded value for b since b=a_(i)−β/2. To clarify this encoding, Table 2 contains all values of b and the corresponding a_(i) that serves as the m-bit encoded representation of b for the radix-4 quaternary case.

TABLE 2 Encoded Values of b for Radix-4 Number System a_(i) b Encoded b in Register B 0 −2 [00]₂ 1 −1 [01]₂ 2 0 [10]₂ 3 1 [11]₂ The computation of T2←T2×b in STEP 5 of the algorithm is accomplished by using a 4:1 multiplexer as a simple lookup structure with data paths of size 2 nm and an m-bit control signal driven by the content of register B. The idea behind this circuit is similar to that of the T3←b×b computation in STEP 3 with the important difference that the possible T2×b values are computed during each iterative step rather than being precomputed and stored before circuit operation. Fortunately, these values are easily and efficiently computed since, for the quaternary implementation, they consist of the value 2(A+β/2) multiplied by only one of bε{−2, −1, 0, 1}. Thus, a negated version of 2(A+β/2)=−[2(A+β/2)] and a single-bit left-shifted shifted version of −[2(A+β/2)] are used as well as 2(A+β/2) and [0 . . . 0] to drive the data inputs of the multiplexer. FIG. 4 contains a diagram of this subcircuit.

The output of the multiplexer 204 is received by combinational logic 206. Combinational logic 206 includes several outputs: one output is coupled to the input of the multiplexer 204. The other outputs of combinational logic 206 are coupled to an adder array 208. Each of the multiplexer 204, the combinational logic 206, and the adder array 208 also include as inputs control signals from a clocked synchronous controller (not shown).

The combinational logic 206 may be implemented based on simplifications in the formation of the intermediate terms T₁, T₂, and T₃, and their various sums. These simplifications exploit the choice of using β=4 as an implicit operand radix and allow for the computation of the intermediate terms T₁, T₂, and T₃ to be implemented with a reduced and simplified set of register transfer level (RTL) operations.

A single quaternary digit [a_(k)]₄ can, in general, be written as a two-bit binary string [b_(2k+1)b_(2k)]₂ where {b_(i)ε

} and

={0,1}. Using this definition, various intermediate terms and their sums can be evaluated for different cases of the least significant digit of the squarand, a₀ε{0, 1, 2, 3}. Term T₁ is independent of the value of a₀ and is always a bit string of length 2n+2 expressed as: T ₁[a _(n−1) a _(n−2) . . . a ₂ a ₁10]₄=[b _(2n−1) b _(2n−2) b _(2n−3) b _(2n−4) . . . b ₅ b ₄ b ₃ b ₂0100]₂ Case 1

a₀=[0]₄ resulting in the residual b=[−2]₄, thus T₃=b²=[10]₄=[0100]₂. Term T₂ can be expressed as:

$\begin{matrix} {T_{2} = {2\left( {A + \frac{\beta}{2}} \right)b}} \\ {= {2b \times \left\lbrack {a_{n - 1}a_{{n - 2}\mspace{14mu}}\ldots\mspace{14mu} a_{2}a_{1}2} \right\rbrack_{4}}} \\ {= {{- \lbrack 10\rbrack_{4}} \times \left\lbrack {a_{n - 1}a_{{n - 2}\mspace{14mu}}\ldots\mspace{14mu} a_{2}a_{1}2} \right\rbrack_{4}}} \\ {= {- \left\lbrack {a_{n - 1}a_{n - 2}\mspace{14mu}\ldots\mspace{11mu} a_{2}a_{1}30} \right\rbrack_{4}}} \end{matrix}$ Combining the terms:

$\begin{matrix} {{T_{1} + T_{2} + T_{3}} = \begin{matrix} {\left\lbrack {a_{n - 1}a_{n - 2}\mspace{14mu}\ldots\mspace{14mu} a_{2}a_{1}10} \right\rbrack_{4} -} \\ {\left\lbrack {a_{n - 1}a_{n - 2}\mspace{14mu}\ldots\mspace{14mu} a_{2}a_{1}20} \right\rbrack_{4} + {- \lbrack 10\rbrack_{4}}} \end{matrix}} \\ {= \left\lbrack {0\mspace{14mu}\ldots\mspace{14mu} 0} \right\rbrack_{4}} \\ {= \left\lbrack {00\mspace{14mu}\ldots\mspace{14mu} 00} \right\rbrack_{2}} \end{matrix}$ Case 2

a₀=[1]₄ resulting in the residual b=[−1]₄, thus T₃=b²=[01]₄=[0001]₂. Term T₂ can be expressed as:

$\begin{matrix} {T_{2} = {2\left( {A + \frac{\beta}{2}} \right)b}} \\ {= {2b \times \left\lbrack {a_{n - 1}a_{{n - 2}\mspace{14mu}}\ldots\mspace{14mu} a_{2}a_{1}2} \right\rbrack_{4}}} \\ {= {{- \lbrack 2\rbrack_{4}} \times \left\lbrack {a_{n - 1}a_{{n - 2}\mspace{14mu}}\ldots\mspace{14mu} a_{2}a_{1}2} \right\rbrack_{4}}} \\ {= {- \left\lbrack {0b_{{2n} - 1}b_{{2n} - 2}\mspace{14mu}\ldots\mspace{14mu} b_{3}b_{2}100} \right\rbrack_{2}}} \end{matrix}$ Combining the terms:

$\begin{matrix} {{T_{1} + T_{2} + T_{3}} = \begin{matrix} {\left\lbrack {b_{{2n} - 1}b_{{2n} - 2}\mspace{14mu}\ldots\mspace{14mu} b_{3}b_{2}0100} \right\rbrack_{2} -} \\ {\left\lbrack {0b_{{2n} - 1}b_{{2n} - 2}\mspace{14mu}\ldots\mspace{14mu} b_{3}b_{2}100} \right\rbrack_{2} + \lbrack 0001\rbrack_{2}} \end{matrix}} \\ {= \left\lbrack {0b_{{2n} - 1}b_{{2n} - 2}\mspace{14mu}\ldots\mspace{14mu} b_{3}b_{2}001} \right\rbrack_{2}} \end{matrix}$ Case 3

a₀=[2]₄ resulting in the residual b=[0]₄, thus T₃=b²=[00]₄=[0000]₂. Term T₂ can be expressed as:

$\begin{matrix} {T_{2} = {2\left( {A + \frac{\beta}{2}} \right)b}} \\ {= {0 \times \left\lbrack {a_{n - 1}a_{{n - 2}\mspace{14mu}}\ldots\mspace{14mu} a_{2}a_{1}2} \right\rbrack_{4}}} \\ {= \left\lbrack {00{\ldots 00}} \right\rbrack_{2}} \end{matrix}$ Combining the terms:

$\begin{matrix} {{T_{1} + T_{2} + T_{3}} = {\left\lbrack {b_{{2n} - 1}b_{{2n} - 2}\mspace{14mu}\ldots\mspace{14mu} b_{3}b_{2}0100} \right\rbrack_{2} - \left\lbrack {00\mspace{20mu}\ldots\mspace{20mu} 00} \right\rbrack_{2} + \lbrack 0000\rbrack_{2}}} \\ {= \left\lbrack {b_{{2n} - 1}b_{{2n} - 2}\mspace{14mu}\ldots\mspace{14mu} b_{3}b_{2}0100} \right\rbrack_{2}} \end{matrix}$ Case 4

a₀=[3]₄ resulting in the residual b=[1]₄, thus T₃=b²=[01]₄=[0001]₂. Term T₂ can be expressed as:

$\begin{matrix} {T_{2} = {2\left( {A + \frac{\beta}{2}} \right)b}} \\ {= {2r \times \left\lbrack {a_{n - 1}a_{{n - 2}\mspace{14mu}}\ldots\mspace{14mu} a_{2}a_{1}2} \right\rbrack_{4}}} \\ {= {\lbrack 2\rbrack_{4} \times \left\lbrack {a_{n - 1}a_{{n - 2}\mspace{14mu}}\ldots\mspace{14mu} a_{2}a_{1}2} \right\rbrack_{4}}} \\ {= \left\lbrack {0b_{{2n} - 1}b_{{2n} - 2}\mspace{14mu}\ldots\mspace{14mu} b_{3}b_{2}100} \right\rbrack_{2}} \end{matrix}$ For this case, the sum T₂+T₃ can be formed directly and it is subsequently combined with term T₁ using the addition circuit. T₂+T₃ is formed as:

$\begin{matrix} {{{T2} + {T3}} = {\left\lbrack {0b_{{2n} - 1}b_{{2n} - 2}\mspace{14mu}\ldots\mspace{14mu} b_{3}b_{2}100} \right\rbrack_{2} + \lbrack 0001\rbrack_{2}}} \\ {= \left\lbrack {0b_{{2n} - 1}b_{{2n} - 2}\mspace{14mu}\ldots\mspace{14mu} b_{3}b_{2}101} \right\rbrack_{2}} \end{matrix}$ Table 3 below contains a summary of the results of the intermediate terms and their various sums in terms of values of the least significant digit of the operand at each iterative step.

TABLE 3 Radix-4 Optimizations LS Intermediate D(α₄, 1) Term Value 0 T₁ + T₂ + T₃ [0...0]₂ 1 T₁ + T₂ + T₃ [0b_(2n−1) b_(2n−2) . . . b₃ b₂ 001]₂ 2 T₁ + T₂ + T₃ [b_(2n−1) b_(2n−2) . . . b₃ b₂ 0100]₂ 3 T₁ [b_(2n−1) b_(2n−2) . . . b₃ b₂ 0100]₂ 3 T₂ + T₃ [0b_(2n−1) b_(2n−2) . . . b₃ b₂ 101]₂

The combinational logic 206 makes use of the results in Table 3 and outputs the two 2n+2 bit values that are summed in the adder array 208 resulting in T₁+T₂+T₃ (i.e., the combinational logic includes two outputs: one for each input of the adder array). For the cases a₀ε{0, 1, 2}, T₁+T₂+T₃ is formed directly in the combinational logic 206 and is input to the adder array 208 on the leftmost input bus with the right-most input set to the 2n+2 bit string [00 . . . 00]₂. The adder array 208 is used for the case of a₀=3, where the left-most input is the bit string [b_(2n−1) b_(2n−2) . . . b₃ b₂0100]₂ and the right-most input is [0b_(2n−1) b_(2n−2) . . . b₃ b₂101]₂.

Accumulator 210 consists of an internal accumulator register, an internal adder circuit, and a feedback loop that allows for the internal adder output to be stored in the internal accumulator register. Accumulator 210 can receive the output of the adder array 208 where it is added to the previously stored value in the accumulator register and then stored back into the accumulator register. A right shift register 212 can receive the output of the accumulator. The size of the register 212 may be 2 nm bits where n is the number of radix-β digits representing a and m denotes the radix. The final operation of each iterative step of the algorithm is to shift the result register 2m bits to the right and insert the 2m least significant bits of T₁+T₂+T₃ into the most significant positions of the shifted result register. Insertion of the two radix-2^(m) digits in the most significant portion of the result register instead of performing a multi-bit left shift before adding them to the previously accumulated result allows the algorithm to be implemented without the need for an inclusion of a multi-bit left shift operation or the use of a barrel shifting circuit in a hardware realization. After the iterative steps are completed, the square α² 214 can be output.

FIGS. 3A-3C are example diagrams of a squaring circuit operating on a six bit string using a two bit substring corresponding to a chosen substring length of m=2 (β=2²=4). FIGS. 3A-3C show three iterations used for calculating a value α², where α has a bit length nm=6 and a substring length m=2. In FIG. 3A, the right-most input bits are represented as a₀ and the remaining bits as A₀. The least significant bits a₀ include 2 bits (m=2). Because the bit length of a is 6, this calculation requires O(3) iterations since 2×3=6. The bits a₀ are received into the squaring module, and output as b₀ that includes 4 bits (2m bits) and representing two output digits (shown on the right side of the squaring module). In FIG. 3B, the remaining 4 bits of a are then considered. The least significant bits are represented as a₁ having 2 bits (m=2), and the remaining 2 bits of α are represented as A₁. The value a₁ is received by the squaring module, and the output b₁ includes 4 bits (2m bits) representing two digits of the output value α². Finally, in FIG. 3C, the remaining bits of a are represented as a₂ (which are the most significant bits of the original string). The value a₂ (having 2 bits) is received by the squaring module, and output as b₂, which also includes 4 bits (2m bits) representing the two most significant bits of the resultant string α².

FIG. 4A-4B are example diagrams of a squaring circuit operating on a six bit string using a three bit substring corresponding to a chosen substring length of m=3 (β=2³=8). FIGS. 4A-4B show two iterations used for calculating a value α², where α has a bit length nm=6 and a substring length m=3 since β is chosen to be 2³. In FIG. 4A, the least significant input bits are represented as a₀ and the remaining bits as A₀. The least significant bits a₀ include 3 bits (m=3). Because the bit length of α is 6, this calculation requires O(2) iterations since 3×2=6. The bits a₀ are received into the squaring module, and output as c₀ including 6 bits (2m bits) representing two digits of α² (shown on the right side of the squaring module). In FIG. 4B, the remaining 3 bits of α are then considered. The most significant bits are represented as a₁ comprised of 3 bits (m=3). Again, the value a₁ is received by the squaring module, and the output c₁ includes 6 bits (2m bits) representing two digits of the output value α².

FIG. 5A is an example of a portion of a process flow diagram 500 for squaring an input value in accordance with the present disclosure. The example process flow described here is applicable for m=2. An input operand α and a substring size m are received (502). The input operand may be identified as a binary digit string, α (504). The least significant digit (LSD) substring, a, of the substring size m can be determined (506). The LSD substring can be decatenated from binary digit string to form a word, A (508). A radix β is determined as 2^(m) (510). A residual value, b, is determined, as b=α−β/2 (512). T₁ is determined as (A/β)2^(2m)+(β/2)² (514).

FIG. 5B is an example of another portion of the process flow diagram 550 for squaring an input value in accordance with the present disclosure. Continuing from step 514, T₂ is determined as 2(A+β/2)b (516). T₃ is determined as b² (518). The value α² can be determined as (A/β)² 2^(2m)+T₁+T₂+T₃ (520). The resulting value α² can be concatenated with previous results, if any, as the most significant digit having bit length 2m (522). A determination may be made as to whether the length of A is greater than the substring size m (524). If A is less than or equal to m, then the process can iterate one additional time to account for a non-zero residual value and then terminates (526). If the size of A is greater than m, then A can be identified as a word A and as the binary digit string, α (528). Then, the process follows back to point (X) 530, which connects prior to point 506 of FIG. 5A. The process then continues until it terminates based on the condition at point 524.

A number of implementations have been described. Nevertheless, it will be understood that various modifications may be made including portions or the entirety of the implementation in software form. Accordingly, other implementations are within the scope of the following claims. 

What is claimed is:
 1. An apparatus comprising: a squaring circuit configured to: identify a fixed-point value with a fixed word size and a substring size for substrings of the fixed-point value, wherein the fixed-point value comprises a string of digits and an initial operand; for an initial iteration, determine a square of least significant digits of the initial operand; for each subsequent iteration: determine an operand for that iteration by decatenating least significant digits from an operand in a previous iteration, wherein a length of the least significant digits from the operand of the previous iteration is equal to the substring size; and determine a square using the least significant digits of the operand for that iteration and the substring size; and concatenate the squares from each iteration to estimate a square of the fixed-point value.
 2. The apparatus of claim 1, wherein each digit in the string of digits is in base
 2. 